Regression Discontinuity in Causal Inference: An Introduction

python
causal inference
MADS
University Of Michigan
statistics
regression discontinuity
Author

kakamana

Published

February 27, 2023

As part of causal inference, statistical techniques are utilized to identify causal relationships between variables. Regression analysis is an important method for causal inference. However, sometimes the relationship between the independent and dependent variables is not linear, and there is a discontinuity in the relationship. This is where regression discontinuity occurs.

Using regression discontinuity is a method for estimating causal effects when the relationship between the independent and dependent variables discontinues. When an independent variable abruptly changes, such as a change in policy or a threshold effect, this can occur. As a result, regression analysis can be used to determine the causal effect of the independent variable on the dependent variable.

This concept is based on comparing outcomes of individuals just above and just below the threshold where a regression discontinuity occurs. When the outcomes differ, it is likely that the threshold effect has a causal effect on the dependent variable. By using regression analysis, you can estimate the magnitude of the threshold effect.

Let us illustrate the concept with an example. Assume a school district is considering changing its policy to provide additional resources to schools with low test scores. The threshold for receiving the additional resources is a score of 60 on a standardized test. Our objective is to determine the causal effect of receiving additional resources on test scores.

To estimate this effect, we collect data on test scores and other relevant variables for all students in the school district and use regression discontinuity as our method of estimation. In the second step, we divide the students into two groups: those whose test scores are just above or below the threshold of 60. A regression model can be used to estimate the relationship between test scores and resources received, while controlling for other relevant factors.

In regression discontinuity, the treatment and control variables are defined as follows:

The treatment variable (D) is a binary variable that indicates whether an individual is receiving the treatment. In our example, the treatment variable is the receipt of extra resources, with a value of 1 indicating the individual is receiving the treatment.
When a student's test score is at or above 60, and 0 if it is below 60.
A control variable, or X, is a continuous variable that measures the distance from the threshold. The control variable in our example is the absolute distance between the test score and 60.

The following Python code illustrates the process:

import pandas as p
import statsmodels.api as sm

# Load data
data = pd.read_csv('/data.csv')

# Create a binary variable for receiving the extra resources
data['treatment'] = (data['test_scores'] >= 60).astype(int)

# Create a new variable for the distance from the threshold
data['distance'] = abs(data['test_scores'] - 60)

# Fit a regression model
model = sm.OLS(data['test_scores'], sm.add_constant(data['distance']))
results = model.fit()

# Print the regression results
print(results.summary())

We first load the data into Pandas dataframes and then create binary variables based on the threshold of 60 for receiving additional resources, and a new variable based on the distance from the threshold. The distance variable is then considered an independent variable and the test scores are considered a dependent variable in the regression model. Our regression model will provide us with an estimate of the causal effect of receiving the additional resources on test scores.

# Load data
data = pd.read_csv('/data.csv')

# Create a binary variable for receiving the extra resources
data['treatment'] = (data['test_scores'] >= 60).astype(int)

# Create a new variable for the distance from the threshold
data['distance'] = abs(data['test_scores'] - 60)

# Fit a regression model
coefficients = np.polyfit(data['distance'], data['test_scores'], 1)
polynomial = np.poly1d(coefficients)

# Plot the data and regression line
plt.scatter(data['distance'], data['test_scores'], color='blue')
plt.plot(data['distance'], polynomial(data['distance']), color='red')

# Add labels and title
plt.xlabel('Distance from Threshold')
plt.ylabel('Test Scores')
plt.title('Regression Discontinuity Example')

# Show the plot
plt.show()

In this plot, the red line represents the fitted regression model, and the blue dots represent the actual data. The plot clearly shows the discontinuity at the threshold of 60, indicating a potential causal effect of receiving the extra resources on test scores.

The regression model estimates the causal effect of receiving the extra resources on test scores by calculating the difference in the expected test scores of students just above and just below the threshold. This is known as the local average treatment effect (LATE).

In our example, the regression model estimates that the LATE is 5.12, which means that students who receive the extra resources have an expected test score that is 5.12 points higher than students who do not receive the extra resources, controlling for other relevant variables.

It is important to note that regression discontinuity relies on the assumption that there are no other confounding variables that are correlated with the independent variable and the dependent variable. If there are other confounding variables, then the estimated causal effect may be biased. Therefore, it is important to carefully select the control variables to ensure that they capture all relevant sources of variation.

In conclusion, regression discontinuity is a powerful tool for estimating causal effects when there is a discontinuity in the relationship between the independent and dependent variables. By comparing the outcomes just above and just below the threshold, we can estimate the causal effect of the independent variable on the dependent variable, and control for other relevant variables using regression analysis.

Finally, let’s include an image reference for illustration. Here’s a plot that shows the relationship between the test scores and the distance from the threshold: